{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c3ee8d00",
   "metadata": {},
   "source": [
    "# How to split by character\n",
    "\n",
    ":::info Prerequisites\n",
    "\n",
    "This guide assumes familiarity with the following concepts:\n",
    "\n",
    "- [Text splitters](/docs/concepts#text-splitters)\n",
    "\n",
    ":::\n",
    "\n",
    "This is the simplest method for splitting text. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
    "\n",
    "1. How the text is split: by single character separator.\n",
    "2. How the chunk size is measured: by number of characters.\n",
    "\n",
    "To obtain the string content directly, use `.splitText()`.\n",
    "\n",
    "To create LangChain [Document](https://v02.api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "313fb032",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document {\n",
      "  pageContent: \"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"... 839 more characters,\n",
      "  metadata: { loc: { lines: { from: 1, to: 17 } } }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "import { CharacterTextSplitter } from \"@langchain/textsplitters\";\n",
    "import * as fs from \"node:fs\";\n",
    "\n",
    "// Load an example document\n",
    "const rawData = await fs.readFileSync(\"../../../../examples/state_of_the_union.txt\");\n",
    "const stateOfTheUnion = rawData.toString();\n",
    "\n",
    "const textSplitter = new CharacterTextSplitter({\n",
    "    separator: \"\\n\\n\",\n",
    "    chunkSize: 1000,\n",
    "    chunkOverlap: 200,\n",
    "});\n",
    "const texts = await textSplitter.createDocuments([stateOfTheUnion]);\n",
    "console.log(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dadcb9d6",
   "metadata": {},
   "source": [
    "You can also propagate metadata associated with each document to the output chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1affda60",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document {\n",
      "  pageContent: \"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"... 839 more characters,\n",
      "  metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "const metadatas = [{ document: 1 }, { document: 2 }];\n",
    "\n",
    "const documents = await textSplitter.createDocuments(\n",
    "    [stateOfTheUnion, stateOfTheUnion], metadatas\n",
    ")\n",
    "\n",
    "console.log(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee080e12-6f44-4311-b1ef-302520a41d66",
   "metadata": {},
   "source": [
    "To obtain the string content directly, use `.splitText()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2a830a9f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\u001b[32m\"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"\u001b[39m... 839 more characters"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "const chunks = await textSplitter.splitText(stateOfTheUnion);\n",
    "\n",
    "chunks[0];"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd4dd67a",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "You've now learned a method for splitting text by character.\n",
    "\n",
    "Next, check out a [more advanced way of splitting by character](/docs/how_to/recursive_text_splitter), or the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
